class: title-slide, left, bottom # Feedforward Neural Networks as Statistical Models ---- ## **Andrew McInerney**, **Kevin Burke** ### University of Limerick #### RSS Northern Ireland, 26 Oct 2022 ??? I am funded by SFI CRT in FDS. This is a multi-institutional collaboration between UL, UCD and MU. The goal of the centre is to fuse and blend the fundamentals of applied mathematics machine learning, and statistics. My research focuses on the combination of the latter two, where I'm looking at a statistical-modelling based approach to neural networks. So, neural networks are typically implemented as black-box models in machine learning, but taking a statistical perspective, I want to show how these models have similarities to the models traditionally used in statistical model. --- # Agenda -- - Feedforward Neural Networks -- - Statistical Perspective -- - Model Selection -- - Statistical Interpretation -- - R Implementation -- <br> Slides: [bit.ly/rss-fnn-stat](https://bit.ly/rss-fnn-stat) Code: [bit.ly/rss-fnn-stat-code](https://bit.ly/rss-fnn-stat-code) --- class: inverse middle center subsection # Feedforward Neural Networks --- # Background Attempts to model the brain. Idea that our brains consists of synapses and neurons. McCulloch and Pitts, Rosenblatt, Hinton. ??? Neural networks origniated from early attempts to model the human brain. The idea was that if we could replicate the process that the human brain uses for learning through an algorithm, then we could achieve artificial intelligence. Our brains are made up of over 86 billion neurons, which are connected together by over 1,000 trillion synapses, which send signals. The idea, credited to McCulloch and Pitts, was to create a single neuron model and use it to solve problems. --- # Background Interest within the statistics community in the late 1980s and early 1990s. Comprehensive reviews provided by White, Ripley, Cheng and Titterington. However, majority of research took place outside the field of statistics Brieman, Hooker and Mentch. --- # Feedforward Neural Networks -- .pull-left[ <img src="data:image/png;base64,#img/FNN.png" width="90%" height="110%" style="display: block; margin: auto;" /> ] <br> <br> -- $$ `\begin{equation} \text{NN}(x_i) = \gamma_0+\sum_{k=1}^q \gamma_k \phi \left( \sum_{j=0}^p \omega_{jk}x_{ji}\right) \end{equation}` $$ --- # Motivating Example -- ### Boston Housing Data (Kaggle) -- 506 communities in Boston, MA. -- Response: - `medv` (median value of owner-occupied homes) -- 12 Explanatory Variables: - `rm` (average number of rooms per dwelling) - `lstat` (proportion of population that are disadvantaged) --- # R Implementation: nnet -- ```r library(nnet) nn <- nnet(medv ~ ., data = Boston, size = 8, maxit = 5000, linout = TRUE) summary(nn) ``` -- ```{.bg-primary} ## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1 ## 2.79 5.92 0.34 1.31 0.23 -1.31 -2.67 0.77 -0.22 1.46 ## i10->h1 i11->h1 ## 1.20 1.26 ## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2 ## 20.53 5.59 3.52 -0.64 12.64 -5.25 -4.12 0.24 2.64 0.49 ## i10->h2 i11->h2 ## -21.17 4.03 ## [...] ``` --- class: inverse middle center subsection # Statistical Perspective --- # Statistical Perspective -- $$ y_i = \text{NN}(x_i) + \varepsilon_i, $$ -- where $$ \varepsilon_i \sim N(0, \sigma^2) $$ <br> -- $$ \ell(\theta)= -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\text{NN}(x_i))^2 $$ --- # Uncertainty Quantification Then, as `\(n \to \infty\)` $$ \hat{\theta} \sim N[\theta, \Sigma = \mathcal{I}(\theta)^{-1}] $$ -- Estimate `\(\Sigma\)` using $$ \hat{\Sigma} = I_o(\hat{\theta})^{-1} $$ -- <br> However, inverting `\(I_o(\hat{\theta})\)` can be problematic in neural networks. --- # Redundancy -- Redundant hidden nodes can lead to issues of unidentifiability for some of the parameters (Fukumizu 1996). <br> -- Redundant hidden nodes `\(\implies\)` Singular information matrix. <br> -- Model selection is required. --- class: inverse middle center subsection # Model Selection --- # Model Selection <img src="data:image/png;base64,#img/FNN-ms.png" width="65%" style="display: block; margin: auto;" /> --- count: false # Model Selection <img src="data:image/png;base64,#img/FNN-vs.png" width="65%" style="display: block; margin: auto;" /> --- count: false # Model Selection <img src="data:image/png;base64,#img/FNN-vsmc.png" width="65%" style="display: block; margin: auto;" /> --- # Proposed Approach .pull-left[ <img src="data:image/png;base64,#img/FNN1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Three phases for model selection: {{content}} ] -- 1. Hidden-node selection {{content}} -- 2. Input-node selection {{content}} -- 3. Fine tuning {{content}} --- # Proposed Approach -- .center[ <figcaption>Hidden Node Selection</figcaption> <img src="data:image/png;base64,#img/hidden-node-2.png" height="125px"/> ] -- .center[ <figcaption>Input Node Selection</figcaption> <img src="data:image/png;base64,#img/input-node-2.png" height="125px"/> ] -- .center[ <figcaption>Fine Tune</figcaption> <img src="data:image/png;base64,#img/fine-tune-2.png" height="125px"/> ] --- # Objective Function -- - Machine Learning: -- $$ `\begin{equation} \text{Out-of-Sample MSE} = \frac{1}{n_\text{val}}\sum_{i=1}^{n_\text{val}} (y_i - NN(x_i))^2 \end{equation}` $$ -- - Proposed: -- $$ `\begin{equation} \text{BIC} = -2\ell(\hat{\theta}) + \log(n)(K + 1), \end{equation}` $$ -- $$ `\begin{equation} K = (p+2)q+1 \end{equation}` $$ --- # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` ] --- count: false # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` <br> No. unimportant inputs: `\(10\)` ] --- count: false # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` <br> No. unimportant inputs: `\(10\)` <br> Max no. hidden nodes: `\(10\)` ] -- .pull-right[ <img src="data:image/png;base64,#img/simFNN.png" width="90%" style="display: block; margin: auto;" /> ] --- # Simulation Results: Approach -- <img src="data:image/png;base64,#img/table-sim-approach.png" width="65%" style="display: block; margin: auto;" /> --- # Simulation Results: Objective Function -- <img src="data:image/png;base64,#img/table-sim-objfun.png" width="50%" style="display: block; margin: auto;" /> -- <img src="data:image/png;base64,#img/table-sim-metrics.png" width="70%" style="display: block; margin: auto;" /> --- class: inverse middle center subsection # Statistical Interpretaion --- # Hypothesis Testing -- .pull-left[ <img src="data:image/png;base64,#img/FNN1.png" width="100%" style="display: block; margin: auto;" /> ] --- count: false # Hypothesis Testing .pull-left[ <img src="data:image/png;base64,#img/FNN2.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Wald test: {{content}} ] -- $$ `\begin{equation} \omega_j = (\omega_{j1},\omega_{j2},\dotsc,\omega_{jq})^T \end{equation}` $$ {{content}} -- $$ `\begin{equation} H_0: \omega_j = 0 \end{equation}` $$ {{content}} -- $$ `\begin{equation} (\hat{\omega}_{j} - \omega_j)^T\Sigma_{\hat{\omega}_{j}}^{-1}(\hat{\omega}_{j} - \omega_j) \sim \chi^2_q \end{equation}` $$ {{content}} --- # Simple Covariate Effect <br> -- $$ `\begin{equation} \hat{\tau_j} = E[\text{NN}(X)|x_{(j)} > a] - E[\text{NN}(X)|x_{(j)} < a] \end{equation}` $$ <br> -- Usually set `\(a = m_j\)`, where `\(m_j\)` is the median value of covariate `\(j\)` -- <br> Associated uncertainty via delta method / bootstrapping --- # Covariate-Effect Plots $$ `\begin{equation} \overline{\text{NN}}_j(x) = \frac{1}{n}\sum_{i=1}^n \text{NN}(x_{(i,1)}, \ldots,x_{(i,j-1)},x, x_{(i,j+1)}, \ldots, x_{(i,p)}) \end{equation}` $$ -- Propose covariate-effect plots of the following form: -- $$ `\begin{equation} \hat{\beta}_j(x,d) = \overline{\text{NN}}_j(x + d) - \overline{\text{NN}}_j(x) \end{equation}` $$ -- Usually set `\(d = \text{SD}(x_j)\)` -- Associated uncertainty via delta method. --- class: inverse middle center subsection # R Implementation --- # R Implementation -- .left-column[ <br> <img src="data:image/png;base64,#img/statnnet.png" width="80%" style="display: block; margin: auto;" /> ] -- .right-column[ <br> <br> ```r # install.packages("devtools") library(devtools) install_github("andrew-mcinerney/statnnet") ``` ] --- # Data Application (Revistied) ### Boston Housing Data (Kaggle) 506 communities in Boston, MA. -- Response: - `medv` (median value of owner-occupied homes) -- 12 Explanatory Variables: - `rm` (average number of rooms per dwelling) - `lstat` (proportion of population that are disadvantaged) --- # Boston Housing: Model Selection ```r library(statnnet) nn <- selectnn(medv ~ ., data = Boston, Q = 10, n_init = 10, maxit = 5000) summary(nn) ``` -- ```{.bg-primary} ## Call: ## selectnn.formula(formula = medv ~ ., data = Boston, Q = 10, n_init = 10, ## maxit = 5000) ## ## Number of input nodes: 8 ## Number of hidden nodes: 4 ## ## Value: -976.5733 ## ## Inputs: ## Covariate Selected Delta.BIC ## rm Yes 236.907 ## lstat Yes 168.023 ## dis Yes 139.305 ## nox Yes 95.203 ## ptratio Yes 59.154 ## indus Yes 38.201 ## rad Yes 35.825 ## crim Yes 8.960 ## chas No -19.769 ## age No -50.105 ## zn No -64.266 ## ## [...] ``` --- # Boston Housing: Model Comparison <img src="data:image/png;base64,#img/modelcom_boston-1.png" width="95%" style="display: block; margin: auto;" /> --- # Boston Housing: Model Comparison <img src="data:image/png;base64,#img/modelcomp_boston_zoom-1.png" width="95%" style="display: block; margin: auto;" /> --- # Boston Housing: Model Summary ```r stnn <- statnnet(nn) summary(stnn) ``` -- ```{.bg-primary} ## [...] ## Coefficients: ## Wald ## Estimate Std. Error | X^2 Pr(> X^2) ## crim -0.115769 0.019085 | 109.8369 0.00e+00 *** ## indus -0.176500 0.018028 | 51.6302 1.65e-10 *** ## nox -0.163091 0.020639 | 39.4919 5.51e-08 *** ## rm 0.201211 0.017924 | 45.5051 3.12e-09 *** ## dis 0.101701 0.022437 | 14.6031 5.60e-03 ** ## rad -0.099667 0.019687 | 107.3354 0.00e+00 *** ## ptratio -0.192649 0.016672 | 7.8733 9.63e-02 . ## lstat -0.263402 0.014443 | 50.2500 3.20e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # Boston Housing: Simple Effects <img src="data:image/png;base64,#img/BostonEffects1-1.png" width="90%" style="display: block; margin: auto;" /> --- # Boston Housing: Covariate-Effect Plots ```r plot(stnn, conf_int = TRUE, method = "deltamethod", which = c(4, 8)) ``` -- .pull-left[ <!-- --> ] -- .pull-right[ <!-- --> ] --- class: final-slide # Summary Feedforward neural networks are non-linear regression models. -- Calculation of a likelihood function allows for uncertainty quantification. -- Our R package extends existing neural network packages to allow for a more interpretable, statistically-based output. --- class: final-slide # References Fukumizu, K. (1996). A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 9(5):871–879. McInerney, A. and Burke, K. (2022). A Statistically-Based Approach to Feedforward Neural Network Model Selection. arXiv preprint arXiv:2207.04248. ### R Package ```r devtools::install_github("andrew-mcinerney/statnnet") ```
<font size="5">andrew-mcinerney</font>
<font size="5">@amcinerney_</font>
<font size="5">andrew.mcinerney@ul.ie</font>